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Abstract In this paper we introduce a micro-clustering strategy for Functional 
Boxplots. The aim is to summarize a set of streaming time series splitted in non 
overlapping windows. It is a two step strategy which performs at first, an on-line 
summarization by means of functional data structures, named Functional Boxplot 
micro-clusters; then it reveals the final summarization by processing, off-line, the 
functional data structures. Our main contribute consists in providing a new defi- 
nition of micro-cluster based on Functional Boxplots and, in defining a proximity 
measure which allows to compare and update them. This allows to get a finer graph- 
ical summarization of the streaming time series by five functional basic statistics of 
data. The obtained synthesis will be able to keep track of the dynamic evolution of 
the multiple streams. 

Key words: Time series data stream, Clustream, Micro-clustering, Functional Box- 
plot 



1 Introduction 



Data stream mining has gained a lot of attention due to the development of ap- 
plications where sensor networks are used for monitoring physical quantities such 
as electricity consumptions, environmental variables, computer network traffic. In 
these applications it is necessary to analyze potentially infinite flows of temporally 
ordered observations which cannot be stored and which have to be processed us- 



Elvira Romano 

Department of European and Mediterrean Studies, Second University of Naples, Caserta, Italy, e- 
maili ielvira . roinano@unina2 . iti 

Antonio Balzanella 

Department of European and Mediterrean Studies, Second University of Naples, Caserta, Italy 
e-mail: antonio .balzanella@unina2 . it 



1 



2 



Elvira Romano and Antonio Balzanella 



ing reduced computational resources. The on-line nature of these data streams re- 
quire the development of incremental learning methods which update the knowledge 
about the monitored phenomenon every time a new observation is collected. 

Among the exploratory tools for data stream processing, clustering methods are 
widely used knowledge extraction tools. Clustering methods in this framework, are 
used to deal with two problems. The first is to identify, from a set of data streams, 
groups of streams having similar behavior This is usually known as clustering of 
time series data streams |6| and some of the main proposals are 13] The sec- 
ond problem is to group the observations that compose a data stream or a set of data 
streams into homogeneous clusters Q, lH). In this case, the observations available 
at any given moment, for the different streams, constitute a p-dimensional (where p 
is the number of data streams) data point. Thus the aim of the on-line algorithm will 
be to find homogeneous groups of the recorded p-dimensional data points. 

Usually these methods also perform the task of summarizing the observed data. 
This is accomplished by identifying a set of centroids which provides the synthesis 
of each homogeneous group of observations. Since the on-line arriving observations 
are deleted after being processed, the type of adopted synthesis is a key point in the 
development of methodologies for data stream clustering. 

Following this second type of approaches, the CluStream algorithm proposed in 
II2I, provides a two-step strategy. The first is an on-line step, named micro-clustering, 
that performs a first on-line summarization of the streams keeping updated a spe- 
cific set of data structures (micro-clusters). The second, is an off-line step named 
macro-clustering, which reveals the final summarization by processing the micro- 
clusters with an appropriate clustering algorithm. The CluStream provides only a 
basic summarization of the data coming from sensors since it only records the aver- 
age and the variance of groups of similar multidimensional items. In this paper we 
extend this algorithm in order to use the Functional Boxplot introduced in |12| as 
tool for gaining knowledge from multiple streaming time series. This will allow to 
get a finer summarization of the streaming time series that keeps into account five 
basic statistics (first and third quartile, median, maximum and minimum value) of 
data and which can be graphically represented. 



2 CluStream of Functional Boxplots 

Let ^^(f), / = 1, . . . ,n, f e [1,°°] a set of streaming time series made by real valued 
ordered observations of a variable Y{t) in n sites, on a discrete time grid. This work 
proposes an incremental clustering algorithm with the aim to supply a set of data 
descriptions or synopsis to reduce dimensionality and to keep track of the dynamic 
evolution of the streams. It processes each example in constant time and memory 
and is incremental in in the sense that data synopsis are incrementally maintained 
as more and more data are received. 
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It is a Clustream algorithm on Functional boxplots obtained by a set of n stream- 
ing time series split in non overlapping windows and opportunely approximated by 
functional data. The method can be summarized by the following steps: 

• On-line phase (FBP-micro-clustering) 

- Splitting of incoming data streams into non overlapping windows; 

- Detection of the Functional Boxplot associated to each window; 

- Updating of appropriate data synopsis called Functional Boxplot micro-clusters. 

• Off-line phase (FBP-macro-clustering) 

- Clustering algorithm performed on the Functional Boxplot micro-clusters. 



2.1 On-line phase 

The first step of the on-line phase, consists in splitting the incoming parallel stream- 
ing time series into a set of non overlapping windows WjJ = 1,...,°°, that are 
compact subsets of T having size w E3i and such that W/H^j+i = 0- The defined 
windows frame for eachy, (f ) a subset 3'"-'(r) t G Wj of ordered values of ^((f ), called 
subsequence. 

Following the Functional Data Analysis approach ||9l, we consider each subse- 
quence ^/"^ (f) of the raw data which includes noise information. Then we 
determinate a true functional form /■ ' {t), we call functional subsequence, which 
describes the trend of the flowing data. For each Wj we have that all the subse- 
quences yl' (f ) / = 1 , . . . , n follow the model: 

y:'(t)^j^'(t) + i^p{t),teWj (1) 

where e^' (f ) are residuals with independent zero mean and fi'{-) is the mean func- 
tion. 

The second step of the on-line phase aims at detecting a summary of the set 
' {t) (with i— 1, ... ,n) of the batched streaming time series by means of a func- 
tional boxplot variables FBPj , j = I,. . . ,°o, defined as follows: 

Definition 1 (Functional Boxplot). Let Wj be a window which frames the subse- 
qunces /I'^^Xf), • • -Ji^'it),- ■ ■ Jn'''{t) (with t e Wj). A Functional Boxplot FBPj is 
a compound of five functions {/[I/ (0 (0 where: 

/j^j* (f ) is the upper bound of the central region; 

f^^l (t) is the lower bound of the central region; 

f^^it) is the median curve 

f^J , (t) is the upper bound of the subsequences 

f^p' J (f) is the lower bound of the subsequences 
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A Functional Boxplot is the analog of classical boxplot for functional data lfT2]| . 
The only difference consists in the data ordering criterion. In particular, since func- 
tions varies over a continuum, data ordering is based on the notion of band depth or 
modified band depth |8|. 

Based on the center outward ordering induced by band depth for functional data, 
the descriptive statistics of a functional boxplot are: the envelope of the 50% cen- 
tral region, the median curve, and the maximum non-outlying envelope. The 50% 
central region is the analog to the "interquartile range" (IQR), it is defined by the 
band delimited by the 50% of deepest, or the most central observations. The border 
of the 50% central region is defined as the envelope representing the box in a clas- 
sical boxplot. The median is the most central observation in the box. The maximum 
envelope of the dataset identified by the vertical lines of the plot are the "whiskers" 
of the boxplot. Formally, let f^^{t) denote the sample of functional subsequence 

associated to the /th largest band depth value. The set f^^{t) ■ ■ ■ if^^J^it) are order 

statistics, with /['|j' (f) the median curve, that is the most central curve (the deepest), 

and is the most outlying curve. The central region of the boxplot is defined 

as 

Co.5 = ((f,r^(0): min /":^(f)<r^(r)< max (2) 

r=l [;i/2] ^' r=\,....\n/2] ^' J 

where [n/2] is the small integer not less than «/2. 

In the third step of the on-line phase, the FBPj variables concur to update a set of 
specific data structures FEPcj^ , k ^ l,...,K we name FBP-micro-clusters, defined 
as: 

Definition 2 (Functional Boxplot Microcluster). A FBP-micro-cluster FBPcj^ , k ~ 
1, ... for a set of FBPj (with j — 1, . . . ,n'^) of functional boxplots is the tuple 
(FBPk,n'',tl'',th) where: 

• FBPji is the functional boxplot which assumes the role of centroid; 

• is the number of allocated functional boxplots; 

• tl'^ is the time stamp of the last update; 

• this a boundary value 

The Functional Boxplot micro-cluster is an extension of the micro-cluster in- 
troduced in ||2l- In our method, its task is to summarize very similar Functional 
boxplots, through a set of statistics which are updatable on-line and able to adapt to 
the change of data. 

In order to achieve the desired space saving, we keep a set of FBP^^ with 
k = I,. ..,K where K is chosen to keep a high representativity of data. Thus K is 
much higher than the clusters in data but much lower than the number of processed 
windows. 

In the on-line step, every time the data of a new window Wj become available, a 
FBPj is constructed and then allocated to a FBPcj^ . The allocation is obtained eval- 
uating the distance between the FBPj and the centroid FBP^ so that if the minimum 
value of distance is lower than the threshold value th stored in the micro-cluster, the 
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allocation is performed to the corresponding FBPc^, otherwise a new one is started 
setting the functional boxplot of the window as centroid and = 1 . 

The allocation is based on the definition of an appropriate distance measure for 
comparing FBPj. It is computed by considering that each couple of correspondent 
functions is compared on the same time interval by means of an alignment of the 
FBPj. 

Let us consider two functional boxplots FBPj, FBPf defined on two windows 
Wj,Wji. Each of them is characterized by the set of five functions Z**^ (f) : Wj — > SR, 

Aligning FBPf to FBPj means finding a function g{t) : Wji — > Wj such that 
/^J (f ) and g o f^J' (r) ~ h^' (f ) are defined on the same interval Wj, with the func- 
tion g{t) expressed by g{t) ^a + bt. 

We consider a e and b ~ 1, that is an alignment. If b ^ I, that is for not only 
misaligned but also warped functions, the function g{t) can be considered a warping 
function as in ifTTll H] . 

Thus, formally, the distance between a pair of functional boxplots FBPj, FBPy 
is defined as follows: 

Definition 3 (Distance). Let FBP,- = {/[^(O and 

/[„] (OJ[fol.„](0,/[,_](f)| be two functional boxplots 

defined, respectively, on Wj and Wf, g{t) be the alignment function so that 
f,^j' (t) = h^j{t), the distance between FBPj and FBPf is: 

d{FBPj,FBPf) = y l„ ifS (0 - ('))'d' + ]J (/5^' (0 - q {t)Ydt + 

The consequences of an allocation are the unitary increment of «*^, the setting of 
the current time stamp for the parameter W'^ and the computation of the FPB-micro- 
cluster centroid. The latter is performed so that for each of the five functions which 
define the Functional Boxplot, the average is kept. This can be obtained starting 
from the information stored in the FBP-micro-cluster self and from the just allocated 
Functional Boxplot. 

In our method, the size K of the set of FPB -micro-cluster is not defined a-priori 
but it adapts to the structure of data, however it strongly depends on the choice of 
the threshold th. A too high value involves that only few FBP-micro-clusters are 
generated; on the contrary, a too low value brings to generate too many FBP-micro- 
clusters. To deal with this issue we introduce an heuristic to set the value of the 
threshold and a criterion to keep the number of functional boxplot micro-clusters 
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under a value K^ax (this allows to keep a constant upper bound of the used memory 

space). 

Particularly, we propose to compute the threshold th as follows: 

th = mind{FBPj,FBPk) \/j,k = l,...,KwithkT^ j (3) 

thus, th is set to the minimum distance between the FBP-micro-cluster centroids. 

If the number of FBP-micro-clusters grows too much so to exceed the available 
memory resources, we propose, alternatively, to discard the micro-clusters record- 
ing concepts no longer present in the data or to merge the two nearest FBP-micro- 
clusters into one. The choice is made by evaluating the time stamp of the last updat- 
ing stored in the parameter tl'^ of each FBP-micro-cluster: 

flf {tnow - tl'') >t* yk=l,...,K ^ Discard FBPc, , 

[Else argminj^k d(FBP j,FBPk) \/j,k=l,...,Kwithkj^ j Merge FBPj,FBPk 

where t„ow is the time stamp of the current window and t* indicates the age over 
which a FBP-micro-cluster has to be considered no longer useful. 



2.2 Off-line phase 

In order to reveal the final summarization of the streams, the off-line phase analyzes 
the FBP-micro-clusters computed on-line. We provide a method to get the sum- 
marization of data behavior over user defined time slots. It is based on storing, at 
predefined time instants, a snapshot of the set of FBP-micro-cluster. Each snapshot 
will collect the state of updating of each FBPci^ in that time instant. 

In order to get the summarization of the user defined time slot, the procedure 
identifies the snapshot that is temporally closer to the lower end of the time inter- 
val (lower snaphot) and the one which is temporally closer to the upper end (upper 
snapshot). The next step is to remove from the state of the functional boxplot micro- 
clusters the effects of the updates that occurred before the beginning of the lower 
snapshot. Since the centroid FBP^ of each FBP-micro-cluster is the average of the 
allocated functional boxplots, it is possible to recover the state of each FBPq re- 
moving what has happened before the beginning of the time slot, by computing a 
component by component weighted difference between the centroid FBP^ as avail- 
able from the upper snapshot and the corresponding FBPj^, obtained from the lower 
snapshot (the weights are the number of allocations stored in the parameter n'^). 

From the output of the previous step, the obtained centroids FBPi^, together with 
the number of allocated items (which assumes the role of weight), become the 
data to be processed by a k-means like algorithm which provides, as output, a par- 
tition of the FBP-micro-clusters centroids into a set MC\ ,MCc, ■ ■ ■ ,MCc (with 
C < ^) of macro-clusters and a new set FBPc, (with c = 1 , . . . , C) of functional box- 
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plots which are the final summaries of the required time interval. Similarly to the 
k-means, this algorithm minimizes an internal heterogeneity measure: 

c 

A = Y,^ d{FBPk;FBP,)nk (4) 



where d{FBPf,FBPc) is computed according to the definition[3j 
In order to optimize the criterion 4, our macro-clustering algorithm iterates, until 
the convergence, an allocation and a centroid computation step. In the allocation 
step, each FBP/t is attributed to the macro cluster whose distance is minimal. In the 
centroid computation step, the representation of each macro-cluster MCc is obtained 
by means of a component by component weighted average where is the weight 
for the corresponding FBPj^. 



3 Daily Rainfall Monitoring by Clustering of FBP 

This section shows the results on real data of the proposed method. We have ana- 
lyzed a dataset provided by the Australian Government - Bureau of meteorology, 
available on-line at http : / /www.bom. gov.au/climate/data/ which records the 
daily rainfall in AustraHa from 1/4/1961 to 30/4/2012. We have downloaded 77 
time series, each one made by 15139 observations and corresponding to a weather 
station located in the Australia region. The choice of the observation period and 
the selection of the weather stations has been carried out in such a way to have no 
missing data. Precipitation is most often rain, but also includes other forms such as 
snow. Observations of daily rainfall are nominally made at 9 am local clock time 
and record the total precipitation for the preceding 24 hours. If, for some reason, an 
observation is unable to be made, the next observation is recorded as an accumula- 
tion, since the rainfall has been accumulating in the rain gauge since the last reading. 
As can be seen from Fig{T[ daily rainfall is characterized by intense variations. The 
highest values of the mean precipitation reached in the fifteen days of the first win- 
dow could seem comparable with the maxima of the twenty five days of the the 
second window. However it is not the same daily rainfall stream but a stream related 
to different stations. In this sense, the overall trend of the phenomenon cannot be 
detected. In the following we show as our method can help to catch the main rain- 
fall behaviors along the whole observation period and to describe and graphically 
represent them by means of a set of Functional boxplots. 

The assessment of the method requires to set two input parameters: the size of 
each window w and the maximum number Kmux of generated FBPcj. micro-clusters. 
We set the first one to w = 30 in order to get on-line computed functional box- 
plots summarizing thirty days of observations. The second parameter has been set 
to Kmax = 50, which represents a good compromise between the detail of summa- 
rization and the memory usage. 
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Fig. 1 Daily Rainfall in two different time windows made by 30 observations. 
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Table 1 The number of on-line computed FBP allocated to each FBPci_ 



From the results, see Tab[T] we can observe that seven FBPcj^ collect more than 5 
on-line computed functional boxplots so these are the the ones that record the main 
concepts in the data. The remaining FBP-micro-clusters summarize the anomalous 
or residual rainfall behaviors. 

The off-line procedure, which is performed taking as input the whole set of 
FBPq, provides a final summarization of the data. We are interested in discover- 
ing how the whole trend changes over the days and if there are dominant structure 
in the data behaviors. Thus, we choose to get four final functional boxplots summa- 
rization( Fig.|2]i. 

Comparing the original curves to the four functional boxplots, we see that the 
latter are very informative to underline the main changes in the data. In all the four 
cases, the curve distributions are asymmetric and positively skewed. The four func- 
tional boxplots differs mostly for: the median curve, that can be interpreted as the 
most representative observed patterns of rainfall data; the central region, that gives 
a less biased visualization of the curves' spread. 

In the first Functional Boxplot, the median curve is characterized by low and os- 
cillating rainfall trend around 1.5mm with higher values between the 23th and 27th 
day. In this case, more information is detected by observing the box. It highlights 
that the 23rd and 25th rainfall are high in the last 10 days. 
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At the opposite the second Boxplot depicts a quasi constant rainfall trend around 
the 3mm (the median curve) with a similar shape of the box but with higher values 
of the rainfall. This indicates that the trend rainfall vary with a constant trend among 
5mm and 7.5mm. The third Functional Boxplot instead, shows lower values of the 
rainfall median curve with an highest values of the box bounds (the values vary 
among 18mm and 20mm). Finally the forth Functional Boxplot highlights a median 
rainfall near to zero except for the 2Qth and 2Eth and a box with a concentration of 
rainfall curves in the third quartile with high variability. 

All the four Functional Boxplot have an envelope bounded by the blue curve 
which has a minimum value corresponding to an absence of rainfall. Thus, the lower 
curve shall be the same with the x-axis. The upper curve limit, indicating the max- 
imum value of the fall of rain, is characterized by four different behaviors linked 
evidently to different period of summarization. In the first Functional Boxplot it can 
be observed a curve with an almost constant trend with a value oscillating around 
30mm for the first twenty days and up to 80mm in the other 10 days. In the second 
and third boxplot on the contrary, the trends vary around the value of 20mm and 
55mm. Finally the fourth boxplot evidence an oscillation value significantly higher 
100mm with a peak of volatily between 2Qth and 25th day. 




Fig. 2 Functional Boxplots summarization for Daily Rainfall Data with four Microcluster The 
blue curves denotes the envelope. The magenta area delimits the 50% central region. The yellow 
curve represents the median curve. 
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4 Concluding remarks 

In this paper we have introduced a new Clustream strategy for multiple streaming 
time series. It is based on a two-step process to handle incremental time series. In 
a first step (the online step) graphical summarizing structures, named Functional 
Boxplots, continuously updated are detected. In the second step (the offline step) a 
final graphical summarization of the flow data is obtained. 

Unlike the existent CluStream strategy in streaming time series literature, we 
have introduced a tool able also to provide a graphic synthesis. 
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